xiv ◾ Preface
databases. Most programs used in this book are open-source Unix/Linux-based programs.
Others can be used in Anaconda environments.
Chapter 1 discusses sequencing data acquisition from NGS technologies and databases,
FASTQ file format, and Phred base call quality. The chapter covers the quality assessment
of the FASTQ and read quality metrics in some detail so that the readers can diagnose
potential problems in raw data and learn how to fix any possible quality problem before
analysis.
Chapter 2 discusses read alignment/mapping to reference genomes. The strategies of
both reference genome indexing algorithms and read mapping algorithms are discussed in
detail with illustrations so that the readers can understand how mapping process works,
the different indexing and alignment algorithms currently used, and which aligners are
good for RNA sequencing applications. The chapter discusses indexing and searching
algorithms like suffix tree, suffix arrays, Burrow-Wheeler Transform (BWT), FM-index,
and hashing, which are the algorithms used by aligners. The chapter then discusses the
mapping process and aligners like BWA, Bowtie, STAR, etc. The SAM/BAM file format
is discussed in detail so that the reader can understand how alignment information are
stored in fields in the SAM/BAM file. Finally, the chapter discusses the manipulation of
alignments in SAM/BAM files using Samtools programs for different purposes, including
SAM to BAM conversion, alignment sorting, indexing BAM files, extracting alignments of
a chromosome or a specific region, filtering and counting alignment, removing duplicate
reads, and generating descriptive statistics.
Chapter 3 discusses de novo genome assembly and de novo assembly algorithms includ-
ing greedy algorithm, overlap-consensus graphs, and de Bruijn graphs. The quality assess-
ment of the assembled genome is discussed through two approaches: statistical approach
and evolutionary approach.
Chapter 4 covers variant calling (SNPs and InDels) in detail. The introduction of this
chapter discusses variants, variant file format (VCF), and the general workflow of the vari-
ant calling. The chapter then discusses both consensus-based variant calling and hap-
lotype-based variant calling and example callers from each group including BCFTools,
FreeBayes, and GATK best practice variant calling pipelines. Finally, the chapter discusses
variant annotation and prioritization and annotation programs including SIFT, SnpEff,
and ANNOVAR.
Chapter 5 discusses RNA-Seq data analysis. The introduction includes RNA-Seq basics
and applications. The chapter then discusses the steps of RNA-Seq analysis workflow,
including data acquisition, read alignment, alignment quality control, quantification,
RNA-Seq data normalization, statistical modeling and differential expression analysis,
using R packages for differential analysis, and visualization of RNA-Seq data.
Chapter 6 covers ChIP-Seq data analysis. It discusses in detail the workflow of ChIP-Seq
data analysis including data acquisition, quality control, read mapping, peak calling, visu-
alizing peak enrichment and peak distribution, peak annotation, peak functional analysis,
and motif discovery.
Chapter 7 discusses targeted gene metagenomic data analysis (amplicon-based microbial
analysis) for environmental and clinical samples. The chapter covers raw data preprocessing